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Abstract: This research deals with massive multiple hypothesis testing. First 
regarding multiple tests as an estimation problem under a proper population 
model, an error measurement called Erroneous Rejection Ratio (ERR) is in- 
troduced and related to the False Discovery Rate (FDR). ERR is an error 
measurement similar in spirit to FDR, and it greatly simplifies the analytical 
study of error properties of multiple test procedures. Next an improved estima- 
tor of the proportion of true null hypotheses and a data adaptive significance 
threshold criterion are developed. Some asymptotic error properties of the sig- 
nificant threshold criterion is established in terms of ERR under distributional 
assumptions widely satisfied in recent applications. A simulation study pro- 
vides clear evidence that the proposed estimator of the proportion of true null 
hypotheses outperforms the existing estimators of this important parameter 
in massive multiple tests. Both analytical and simulation studies indicate that 
the proposed significance threshold criterion can provide a reasonable balance 
between the amounts of false positive and false negative errors, thereby com- 
plementing and extending the various FDR control procedures. S-plus/R code 
is available from the author upon request. 



1. Introduction 

The recent advancement of biological and information technologies made it possi- 
ble to generate unprecedented large amounts of data for just a single study. For 
example, in a genome-wide investigation, expressions of tens of thousands genes 
and markers can be generated and surveyed simultaneously for their association 
with certain traits or biological conditions of interest. Statistical analysis in such 
applications poses a massive multiple hypothesis testing problem. The traditional 
approaches to controlling the probability of family-wise type-I error have proven to 
be too conservative in such applications. Recent attention has been focused on the 
control of false discovery rate (FDR) introduced by Benjamini and Hochberg [4]. 
Most of the recent methods can be broadly characterized into several approaches. 
Mixture-distribution partitioning [2, 24, 25] views the P values as random variables 
and models the P value distribution to generate estimates of the FDR levels at vari- 
ous significance levels. Significance analysis of microarrays (SAM; [32, 35]) employs 
permutation tests to inference simultaneously on order statistics. Empirical Baysian 
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approaches include for example [10, 11, 17, 23, 28]. Tsai et al. [34] proposed models 
and estimators of the conditional FDR, and Bickel [6] takes a decision-theoretic 
approach. Recent theoretical developments on FDR control include Genovese and 
Wasserman [13, 14], Storey et al. [31], Finner and Roberts [12], and Abramovich et 
al. [1]. Recent theoretical development on control of generalized family- wise type-I 
error includes van der Laan et al. [36, 37], Dudoit et al. [9], and the references 
therein. 

Benjamini and Hochberg [4] argue that as an alternative to the family-wise type-I 
error probability FDR is a proper measurement of the amount of false positive 
errors, and it enjoys many desirable properties not possessed by other intuitive 
or heuristic measurements. Furthermore they develop a procedure to generate a 
significance threshold (P value cutoff) that guarantees the control of FDR under a 
pre-specified level. Similar to a significance test, FDR control requires one to specify 
a control level a priori. Storey [29] takes the point of view that in discovery-oriented 
applications neither the FDR control level nor the significance threshold may be 
specified before one sees the data (P values), and often the significance threshold 
is so determined a posteriori that allows for some "discoveries" (rejecting one or 
more null hypotheses). These "discoveries" are then scrutinized in confirmation 
and validation studies. Therefore it would be more appropriate to measure the 
false positive errors conditional on having rejected some null hypotheses, and for 
this purpose the positive FDR (pFDR; Storey [29]) is a meaningful measurement. 
Storey [29] introduces estimators of FDR and pFDR, and the concept of g-value 
which is essentially a neat representation of Benjamini and Hochberg's ([4]) step- 
up procedure possessing a Bayesian interpretation as the posterior probability of 
the null hypothesis ([30]). Reiner et al. [26] introduce the "FDR-adjusted P value" 
which is equivalent to the q- value. The q- value plot ([33]) allows for visualization 
of FDR (or pFDR) levels in relationship to significance thresholds or numbers of 
null hypotheses to reject. Other closely related procedures are the adaptive FDR 
control by Benjamini and Hochberg [3], and the recent two-stage linear step-up 
procedure by Benjamini et al. [5] which is shown to provide sure FDR control at 
any pre-specified level. 

In discovery-oriented exploratory studies such as genome-wide gene expression 
survey or association rule mining in marketing applications, it is desirable to strike 
a meaningful balance between the amounts false positive and false negative errors 
than to control the FDR or pFDR alone. Cheng et al. [7] argue that it is not 
always clear in practice how to specify the threshold for either the FDR level or the 
significance level. Therefore, additional statistical guidelines beyond FDR control 
procedures are desirable. Genovese and Wasserman [13] extend FDR control to a 
minimization of the "false nondiscovery rate" (FNR) under a penalty of the FDR, 
i.e., FNR+XFDR, where the penalty A is assumed to be specified a priori. Cheng et 
al. [7] propose to extract more information from the data (P values) and introduce 
three data-driven criteria for determination of the significance threshold. 

This paper has two related goals: (1) develop a more accurate estimator of the 
proportion of true null hypotheses, which is an important parameter in all multiple 
hypothesis testing procedures; and (2) further develop the "profile information crite- 
rion" I.p introduced in [7] by constructing a more data-adaptive criterion and study 
its asymptotic error behavior (as the number of tests tends to infinity) theoretically 
and via simulation. For theoretical and methodological development, a new mean- 
ingful measurement of the quantity of false positive errors, the erroneous rejection 
ratio (ERR), is introduced. Just like FDR, ERR is equal to the family- wise type-I 
error probability when all null hypotheses are true. Under the ergodicity conditions 
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used in recent studies ([14, 31]), ERR is equal to FDR at any significant threshold 
(P value cut-off). On the other hand, ERR is much easier to handle analytically 
than FDR under distributional assumptions more widely satisfied in applications. 
Careful examination of each component in ERR gives insights into massive multi- 
ple testing in terms of the ensemble behavior of the P values. Quantities derived 
from ERR suggest to construct improved estimators of the null proportion (or the 
number of true null hypotheses) considered in [3, 29, 31], and the construction of an 
adaptive significance threshold criterion. The theoretical results demonstrate how 
the criterion can be calibrated with the Bonferroni adjustment to provide control of 
family-wise type-I error probability when all null hypotheses are true, and how the 
criterion behaves asymptotically, giving cautions and remedies in practice. The sim- 
ulation results are consistent with the theory, and demonstrate that the proposed 
adaptive significance criterion is a useful and effective procedure complement to the 
popular FDR control methods. 

This paper is organized as follows: Section 2 contains a brief review of FDR 
and the introduction of ERR; section 3 contains a brief review of the estimation 
of the proportion of null hypotheses, and the development of an improved estima- 
tor; section 4 develops the adaptive significance threshold criterion and studies its 
asymptotic error behavior (as the number of hypotheses tends to infinity) under 
proper distributional assumptions on the P values; section 5 contains a simulation 
study; and section 6 contains concluding remarks. 

Notation. Henceforth, R denotes the real line; M. k denotes the k dimensional Eu- 
clidean space. The symbol || • |j p denotes the L p or l v norm, and := indicates equal 
by definition. Convergence and convergence in probability are denoted by — ► and 
— > p respectively. A random variable is usually denoted by an upper-case letter 
such as P, i?, V, etc. A cumulative distribution function (cdf) is usually denoted 
by F, G or H; an empirical distribution function (EDF) is usually indicated by 
a tilde, e.g., F. A population parameter is usually denoted by a lower-case Greek 
letter and a hat indicates an estimator of the parameter, e.g., 8. Equivalence is 
denoted by ~, e.g., "o„ ~ b n as n — ► oo" means lim„ >oc a n /b n = 1. 

2. False discovery rate and erroneous rejection ratio 

Consider testing m hypothesis pairs (H i, Hai), i = 1, . . . , to. In many recent appli- 
cations such as analysis of microarray gene differential expressions, m is typically 
on the order of 10 5 . Suppose to P values, Pi, ... , P m , one for each hypothesis pair, 
are calculated, and a decision on whether to reject H i is to be made. Let Too be the 
number of true null hypotheses, and let mi := m — too be the number of true alter- 
native hypotheses. The outcome of testing these to hypotheses can be tabulated as 
in Table 1 (Bcnjamini and Hochberg [4]), where V is the number of null hypotheses 
erroneously rejected, S is the number of alternative hypotheses correctly captured, 
and R is the total number of rejections. 

Clearly only m is known and only R is observable. At least one family-wise 
type-I error is committed if V > 0, and procedures for multiple hypothesis testing 
have traditionally been produced for solely controlling the family-wise type-I error 
probability Pr(V > 0). It is well-known that such procedures often lack statistical 
power. In an effort to develop more powerful procedures, Benjamini and Hochberg 
([4]) approached the multiple testing problem from a different perspective and in- 
troduced the concept of false discovery rate (FDR), which is, loosely speaking, the 
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expected value of the ratio V / R. They introduced a simple and effective procedure 
for controlling the FDR under any pre-specified level. 

It is convenient both conceptually and notationally to regard multiple hypotheses 
testing as an estimation problem ([7]). Define the parameter = [9\,. .. ,9 m ] as 
Oi = 1 if Hai is true, and 6i = if H M is true (i = 1, . . . , to). The data consist of 
the P values {Pi, . . . , Pm}, and under the assumption that each test is exact and 
unbiased, the population is described by the following probability model: 

Pi~Vi,o t ; 

(2.1) V ifi is 17(0,1), and p iA < st U(0, 1); 

each Vi y \ has a twice continuously differentiable cdf P(-), 

for i = 1,...,to, where < st stands for "stochastically less than." The P values 
are dependent in general and have a joint distribution on R m . By this model the 
marginal cdf of Pi can be written as Gi(t) = (1 — 9i)t + 9iFi(t). Note Fi(t) > t and 
Gi(t) >t forte [0,1]. 

A rejection procedure is an estimator of 0: 9 = @(Pi, ■ ■ . , P m ) = [9i, ■ ■ ■ , 9 m ] £ 
{0, l} m , where $i = 1 indicates rejecting H oi in favor of Hai 1 i = 1, . . . , to. With 
this notation, the random variables in Table 1 can be expressed as 

m mm 

(2.2) V = Ve(e) = 5^(l-0i)0j; 5=Se(e) = ^^; P=fi(6) = ^^. 

i— 1 i— 1 i— 1 

A natural and perhaps the simplest procedure is the "hard-thresholding" (HT) 
estimator = 0(a) defined as 

(2.3) HT(a) : 9i = 1 iff P 4 < a, 

where a <E (0, 1) is a significance threshold common to all hypotheses. Clearly for 
this procedure the random variables V, S, and R all depend on a. Traditional 
control of family-wise type-I error probability seeks to determine a so that Pr(V > 
0) < a* for pre-specified a* . Genovese and Wasserman [14] list several procedures to 
determine a. Benjamini and Hochberg [4] introduce a simple procedure to determine 
a so that the FDR is controlled at a given level. 



2.1. False discovery rate and its control 

The FDR as defined by Benjamini and Hochberg ([4]) can be expressed as 

£™ i£(i-0*) 



(2.4) 



FDR e (0) = E 



££ift+n£i(i-*i) 



which is equivalent to E[V / R\R > 0] Pr(i? > 0). Let P 1:m < P 2: m <■■■ < Pm-.m be 
the order statistics of the P values, and let tt = toq/to. Benjamini and Hochberg 
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([4]) prove that for any specified q* <G (0, f) rejecting all the null hypotheses cor- 
responding to Pi-.m, . . . , Pk'-.m with k* — max{fc : Pk :m / (k/m) < q*} controls the 
FDR at the level ir q*, i.e., FDRQ(0(P k , :rn )) < -K q* < q*. Note this procedure is 
equivalent to applying the data-driven threshold a = Pk*-. m to all P values in l|2.3|) . 
i.e., HT(P k . :m ). 

Recognizing the potential of constructing less conservative FDR controls by the 
above procedure, Benjamini and Hochberg ([3]) propose an estimator of too, too, 
(hence an estimator of ttq, ttq — rho/m), and replace k/m by k/fho in determining 
k* . They call this procedure "adaptive FDR control." The estimator 7?o = fho/rn 
will be discussed in Section 3. A recent development in adaptive FDR control can 
be found in Benjamini et al. [5]. 

Similar to a significance test, the above procedure requires the specification of 
an FDR control level q* before the analysis is conducted. Storey ([29]) takes the 
point of view that for more discovery-oriented applications the FDR level is not 
specified a priori, but rather determined after one sees the data (P values), and 
it is often determined in a way allowing for some "discovery" (rejecting one or 
more null hypotheses). Hence a concept similar to, but different than FDR, the 
positive false discovery rate (pFDR) E \j/ / R\R > 0] , is more appropriate. Storey 
([29]) introduces estimators of ttq, the FDR, and the pFDR from which the q-values 
are constructed for FDR control. Storey et al. ([31]) demonstrate certain desirable 
asymptotic conservativeness of the q-values under a set of ergodicity conditions. 



2.2. Erroneous rejection ratio 

As discussed in [3, 4], the FDR criterion has many desirable properties not pos- 
sessed by other intuitive alternative criteria for multiple tests. In order to obtain 
an analytically convenient expression of FDR for more in-depth investigations and 
extensions, such as in [13, 14, 29, 31], certain fairly strong ergodicity conditions 
have to be assumed. These conditions make it possible to apply classical empiri- 
cal process methods to the "FDR process." However, these conditions may be too 
strong for more recent applications, such as genome-wide tests for gene expression- 
phenotype association using microarrays, in which a substantial proportion of the 
tests can be strongly dependent. In such applications it may not be even reasonable 
to assume that the tests corresponding to the true null hypotheses are independent, 
an assumption often used in FDR research. Without these assumptions however, 
the FDR becomes difficult to handle analytically. An alternative error measurement 
in the same spirit of FDR but easier to handle analytically is defined below. 
Define the erroneous rejection ratio (ERR) as 

(2.5) ERRq(Q) = E ^ Ve ^ Pt(R(@) > 0). 

E[R(Q)] 

Just like FDR, when all null hypotheses are true ERR = Pr(i?(9) > 0), which is the 
family-wise type-I error probability because now Ve(O) = R(0) with probability 
one. Denote by V(a) and R(a) respectively the V and R random variables and by 
ERR(a) the ERR for the hard-thresholding procedure HT(a); thus 
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Careful examination of each component in ERR(a) reveals insights into multiple 
tests in terms of the ensemble behavior of the P values. Note 

E[V(a)} = i(l - Oi) Pr(?, = 1) = m Q a 

E[R(a)] = YZi = 1) = + Ej-.o^i F i(<*) 

Pv(R(a) > 0) = Pr(P 1:m < a). 

Define H m (t) := mf 1 E^.=i ^(*) and F m (t) := m" 1 Y7=i = *ot + (1 - 

ir )H m (t). Then 

(2.7) £itf2(a) = Pr(P 1:m < a). 

The functions H m (-) and F m (-) both are cdf's on [0, 1]; -ff m is the average of the 
P value marginal cdf's corresponding to the true alternative hypotheses, and F m 
is the average of all P value marginal cdf's. F m describes the ensemble behavior of 
all P values and H rn describes the ensemble behavior of the P values corresponding 
to the true alternative hypotheses. Cheng et al. ([7]) observe that the EDF of the 
P values F m (t) := mT 1 Y^T=i ^(-^» < t), t e M. is an unbiased estimator of F m (-), 
and if the tests 9 (i = 1, ...,m) arc not strongly correlated asymptotically in 
the sense that Ei=^ Cov(6i,6j) = o(m 2 ) as m — ► oo, F m (-) is "asymptotically 
consistent" for F m in the sense that \F m (t) — F m (t)\ — > p for every teR. This 
prompts possibilities for the estimation of 7ro, data- adaptive determination of a for 
the HT(a) procedure, and the estimation of FDR. The first two will be developed 
in detail in subsequent sections. Cheng et al. ([7]) and Pounds and Cheng ([25]) 
develop smooth FDR estimators. 

Let FDR(a) := E[V (a) / R(a)\R(a) > 0] Pi(R(a) > 0). ERR(a) is essentially 
FDR(a). Under the hierarchical (or random effect) model employed in several 
papers ([11, 14, 29, 31]), the two quantities are equivalent, that is, FDR(a) = 
ERR(a) for all a G (0,1], following from Lemma 2.1 in [14]. More generally 
ERR/FDR = {E[V}/E[R}} /E[V/R\R> 0] provided Pr(i? > 0) > 0. Asymp- 
totically as m — ► oo, if Pr(i? > 0) — > 1 then E [V/R\R > 0] ~ E [V/R]; if fur- 
thermore E[V/R] ~ E[V]/E{R], then ERR/ FDR — > 1. Identifying reasonable 
sufficient (and necessary) conditions for E [V/R] ~ E[V]/E[R] to hold remains an 
open problem at this point. 

Analogous to the relationship between FDR and pFDR, define the positive ERR, 
pERR := E[V]/E[R}. Both quantities are well-defined provided Pr(i? > 0) > 0. 
The relationship between pERR and pFDR is the same as that between ERR and 
FDR described above. 

The error behavior of a given multiple test procedure can be investigated in 
terms of either FDR (pFDR) or ERR (pERR). The ratio pERR = E[V]/E[R] can 
be handled easily under arbitrary dependence among the tests because E[V] and 
E[R] are simply means of sums of indicator random variables. The only possible 
challenging component in ERR(a) is Pr(R(a) > 0) = Pr(Pi ;TO < a); some as- 
sumptions on the dependence among the tests has to be made to obtain a concrete 
analytical form for this probability, or an upper bound for it. Thus, as demonstrated 
in Section 4, ERR is an error measurement that is easier to handle than FDR under 
more complex and application-pertinent dependence among the tests, in assessing 
analytically the error properties of a multiple hypothesis testing procedure. 

A fine technical point is that FDR (pFDR) is always well-defined and ERR 
(pERR) is always well-defined under the convention a ■ = for a e [— oo,+oo]. 
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Compared to FDR (pFDR), ERR (pERR) is slightly less intuitive in interpretation. 
For example, FDR can be interpreted as the expected proportion of false positives 
among all positive findings, whereas ERR can be interpreted as the proportion of 
the number of false positives expected out of the total number of positive findings 
expected. Nonetheless, ERR (pERR) is still of practical value given its close rela- 
tionship to FDR (pFDR), and is more convenient to use in analytical assessments 
of a multiple test procedure. 

3. Estimation of the proportion of null hypotheses 

The proportion of the true null hypotheses ttq is an important parameter in all 
multiple test procedures. A delicate component in the control or estimation of 
FDR (or ERR) is the estimation of tt . The cdf F m (t) = ir Q t + (1 - n )H m (t), 
t £ [0, 1], along with the fact that the EDF F m is its unbiased estimator provides a 
clue for estimating 7r . Because for any t E (0, 1) n = [H m {t) — F m (t)]/[H m (t) — t], 
a plausible estimator of ttq is 

A-F m (t ) 
K0= A-to 

for properly chosen A and to. Let Q m (u) := i ;l m 1 (u), u £ [0,1] be the quantile 
function of F m and let Q m (u) := (u) := inf{ir: F m (x) > u} be the empirical 
quantile function (EQF), then n = [H m (Q m (u)) - u}/[H m (Q m (u)) - Q m {u)}, for 
u G (0, 1), and with Ai and uq properly chosen 

Ai - u 

7T = - 

Ai - Q m (uo) 

is a plausible estimator. The existing 7r estimators take either of the above repre- 
sentations with minor modifications. 

Clearly it is necessary to have Ai > u a for a meaningful estimator. Because 
Qm{ u o) < M o by the stochastic order assumption [cf. H2.1[) ]. choosing Ai too close 
to uo will produce an estimator much biased downward. Benjamini and Hochberg 
([3]) use the heuristic that if uq is so chosen that all P values corresponding to 
the alternative hypotheses concentrate in [0, Q m (uo)] then H m {Q m {uo)) = 1; thus 
setting Ai — 1. Storey ([29]) uses a similar heuristic to set A = 1. 

3.1. Existing estimators 

Taking a graphical approach Schweder and Spj0tvoll [27] propose an estimator of 
mo as m o = m (l — F m (X)) / (1 — A) for a properly chosen A; hence a corresponding 
estimator of 7r is n = m n /m = (1 — F m (A))/ (1 — A). This is exactly Storey's 
( [29] ) estimator. Storey observes that A is a tuning parameter that dictates the bias 
and variance of the estimator, and proposes computing ttq on a grid of A values, 
smoothing them by a spline function, and taking the smoothed 7To at A = 0.95 as 
the final estimator. Storey et al. ([31]) propose a bootstrap procedure to estimate 
the mean-squared error (MSE) and pick the A that gives the minimal estimated 
MSE. It will be seen in the simulation study (Section 5) that this estimator tends 
to be biased downward. 
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Approaching to the problem from the quantile perspective Benjamini and Hoch- 
berg ([3]) propose mo = min{l + (m + 1 — — Pj-.m), m } f° r a properly chosen 
j; hence 



1 ■ " 



7Tq = mm < — 



1 — P 

1 J J : m 



1 — j'/m + 1/m 



1 



The index j is determined by examining the slopes Si = (1 — Pi-. m )j {m + 1 — i), 
i = 1, . . . , m, and is taken to be the smallest index such that Sj < Sj—i. Then 
mo = min{l + 1 j Sj , to}. It is not difficult to see why this estimator tends to be too 
conservative (i.e., too much biased upward): as to gets large the event {Sj < Sj-i} 
tends to occur early (i.e., at small j) with high probability. By definition, Sj < Sj-i 
if and only if 

1 Pj-.m 1 Pj — l:m 



if and only if 



Thus, as to — > oo, 



to + 1 — j to + 2 — j 



1 TO + 1 — J 

i:m > to + 2 - j + m + 2-j j - 



Pr (S, < Sj-!) = Pr (p J:m > — I ; + m + \ J - Pj- 1:m ) — ► 1, 

\ to+2-j to+2-j j / 

for fixed or small enough j satisfying j/m — >5e [0, 1). The conservativeness will 
be further demonstrated by the simulation study in Section 5. 

Recently Mosig et al. ([21]) proposed an estimator of to by a recursive algorithm, 
which is clarified and shown by Nettleton and Hwang [22] to converge under a fixed 
partition (histogram bins) of the P value order statistics. In essence the algorithm 
searches in the right tail of the P value histogram to determine a "bend point" 
when the histogram begins to become flat, and then takes this point for A (or j). 

For a two-stage adaptive control procedure Benjamini et al. ([5]) consider an 
estimator of Too derived from the first-stage FDR control at the more conservative 
q/{\ + q) level than the targeted control level q. Their simulation study indicates 
that with comparable bias this estimator is much less variable than the estimators 
by Benjamini and Hochberg [3] and Storey et al. [31], thus possessing better accu- 
racy. Recently Langaas et al. ([19]) proposed an estimator based on nonparametric 
estimation of the P value density function under monotone and convex contraints. 



3.2. An estimator by quantile modeling 

Intuitively, the stochastic order requirement in the distributional model l|2.1|l im- 
plies that the cdf F m (-) is approximately concave and hence the quantile function 
Qm(') is approximately convex. When there is a substantial proportion of true null 
and true alternative hypotheses, there is a "bend point" r m S (0, 1) such that Q m (-) 
assumes roughly a nonlinear shape in [0, r m ], primarily dictated by the distributions 
of the P values corresponding to the true alternative hypotheses, and Q m (-) is es- 
sentially linear in [r m , 1], dictated by the U(0, 1) distribution for the null P values. 
The estimation of ttq can benefit from properly capturing this shape characteristic 
by a model. 

Clearly ir < [1 - r m ]/[H m (Q m (r m )) - Q m (r m )]. Again heuristically if all P 
values corresponding to the alternative hypotheses concentrate in [0, Q m (r m )], then 
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H m (Q m (T m )) — 1- A strategy then is to construct an estimator of Q m {-), Q m {-), 
that possesses the desirable shape described above, along with a bend point r m , 
and set 

(3.1) % ' ' ' ? 



which is the inverse slope between the points (r m , Q m (r m )) and (1, 1) on the unit 
square. 

Model (2.1) implies that Q m { ) is twice continuously differentiable. Taylor ex- 
pansion at t = gives Q m (t) — q m (0)t + \q' m {£,t)t 2 for t close to and < £t < t, 
where q m {') is the first derivative of Q m (-), i-e., the quantile density function (qdf), 
and q' m {-) is the second derivative of Q m {-). This suggests the following defini- 
tion (model) of an approximation of Q m by a convex, two-piece function joint 
smoothly at r m . Define Q (t) := mm{Q m (t),t}, t £ [0,1], define the bend point 
T m := argmaxjji — Q (t)} and assume that it exists uniquely, with the convention 
that r m = if Q m (F) = t for all t £ [0, 1]. Define 



(3.2) Q^(t;7,o,d,6i,6o,7V, 
where 



a< 7 + dt, < t < t„ 
b Q + bit, t > T m 



bi = [1 - Qm(r m )] / (1 - r m ) 

6 = 1 - &i = [Q m (r m ) - r m ] / (1 - r m ) , 

and 7, a and d are determined by minimizing 7, a, d, 61, 607 T m) — <5m( , )l|i 

under the following constraints: 

7 > 1, a>0, 0<d<l 
7 = a = 1, d = 0if and only if r m = 
ot^ + dr m = bo + b\T rn (continuity at r m ) 
a 7 T m~ 1 + d = 61 (smoothness at r m ). 

These constraints guarantee that the two pieces are joint smoothly at r m to produce 
a convex and continuously differentiable quantile function that is the closest to Q rn 
on [0, 1] in the L 1 norm, and that there is no over-parameterization if Q m coincides 
with the 45-degree line. Q* n will be called the convex backbone of Q m - 

The smoothness constraints force a, d and 7 to be interdependent via bo, 61 and 
r rn . For example, 

a = 0(7) = -b Q / [(7 - l)r m ] (for 7 > 1) 
d = d(7) = h - a(7)7^r 1 - 

Thus the above constrained minimization is equivalent to 
min 7 \\Q* m (-; 7, a(^), d(j), b%, b 

(3.3) 



subject to 



7 > 1, 0(7) > 0, < d( 7 ) < 1 

7 = a = 1, d = if and only if r m = 0. 



An estimator of ttq is obtained by plugging an estimator of the convex back- 
bone Q* m , Q* m , into (|3.1(l . The convex backbone can be estimated by replacing Q. m 
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with the EQF Q m in the above process. However, instead of using the raw EQF, 
the estimation can benefit from properly smoothing the modified EQF Q (t) := 

min{Q m (t),t}, t <E [0, 1] into a smooth and approximately convex EQF, Q m (-)- This 
smooth and approximately convex EQF can be obtained by repeatedly smoothing 
the modified EQF Q (•) by the variation-diminishing spline (VD-spline; de Boor 
[7], P.160). Denote by Bj :t ,k the jth order-fc B spline with extended knot sequence 
t = ti, . . . , t n+ k (ti = . . .tf. = < tk+i < ■ ■ ■ < t n < t n+ i = . . . = t n+ k = 1) and 
t* := Yje=j+i — 1). The VD-spline approximation of a function h: [0, 1] — > M 

is defined as 

n 

(3.4) h(u) := Kt*)B jtt;k (u), u e [0, 1]. 

i=i 

The current implementation takes k = 5 (thus quartic spline for Q m and cubic 
spline for its derivative, <f m ), and sets the interior knots in t to the ordered unique 
numbers in {i, ^, ^} U {F m (t), < = 0.001,0.003,0.00625,0.01,0.0125,0.025, 
0.05, 0.1, 0.25}. The knot sequence is so designed that the variation in the quantile 
function in a neighborhood close to zero (corresponding to small P values) can be 
well captured; whereas the right tail (corresponding to large P values) is greatly 
smoothed. Key elements in the algorithm, such as the interior knots positions, 
the t* positions, etc., are illustrated in Figure 1. 

Upon obtaining the smooth and approximately convex EQF Q m (')) the convex 
backbone estimator Q„(-) is constructed by replacing Q m (-) with Q m (-) in (I3.3|l 
and numerically solving the optimization with a proper search algorithm. This 
algorithm produces the estimator 7To in 13. 1|) at the same time. 

Note that in general the parameters 7, a, d, bo, 61, 7To :— mo/m, and their 
corresponding estimators all depend on m. For the sake of notational simplic- 
ity this dependency has been and continues to be suppressed in the notation. 
Furthermore, it is assumed that linim^oo mo /to exists. For studying asymptotic 
properties, henceforth let {Pi, P2, ■ ■ ■} be an infinite sequence of P values, and let 

4. Adaptive profile information criterion 

4-1. The adaptive profile information criterion 

We now develop an adaptive procedure to determine a significance threshold for 
the HT(a) procedure. The estimation perspective allows one to draw an analogy 
between multiple hypothesis testing and the classical variable selection problem: 
setting 9i = 1 (i.e., rejecting the ith null hypothesis) corresponds to including the 
zth variable in the model. A traditional model selection criterion such AIC usually 
consists of two terms, a model-fitting term and a penalty term. The penalty term is 
usually some measure of model complexity reflected by the number of parameters to 
be estimated. In the context of massive multiple testing a natural penalty (complex- 
ity) measurement would be the expected number of false positives -E[V(a;)] = noma 
under model 1)2.1(1 . When a parametric model is fully specified, the model-fitting 
term is usually a likelihood function or some similar quantity. In the context of 
massive multiple testing the stochastic order assumption in model (|2.1|) suggests 
using a proper quantity measuring the lack-off-fit from f/(0, 1) in the ensemble dis- 
tribution of the P values on the interval [0,a]. Cheng et al. ([7]) considered such 
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a measurement that is an L 2 distance. The concept of convex backbone facilitates 
the derivation of a measurement more adaptive to the ensemble distribution of the 
P values. Given the convex backbone Q m {-) := Q m (-', 7, a, d, b\, bo, r m ) as defined 
in (3.2), the "model-fitting" term can be defined as the L 1 distance between Q* m {-) 
and uniformity on [0, a]: 



D 7 (a) := 



1/7 



a G (0,1]. 



The adaptivity is reflected by the use of the L 1 distance: Recall that the larger 
the 7, the higher concentration of small P values, and the norm inequality (Hardy, 
Littlewood, and Polya [16], P.157) implies that D 72 (a) > D 7l (a) for every a 6 (0, 1] 
if 72 > 7i- 

Clearly D 1 (a) is non-decreasing in a. Intuitively one possibility would be to 
maximize a criterion like D 7 (a) — \-K ma. However, the two terms are not on the 
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same order of magnitude when m is very large. The problem is circumvented by 
using l/Dj(a), which also makes it possible to obtain a closed- form solution to 
approximately optimizing the criterion. 

Thus define the Adaptive Profile Information (API) criterion as 



(4.1) API (a) :-- 



(t 



Mfdt 



-l/ 7 



A(m, 7To, d)miroa, 



for a £ (0,1) and Q* n {-) := Q^ n (-; , y,a,d,bi,bo,T m ) as defined in (|3.2|) . One seeks 
to minimize API (a) to obtain an adaptive significance threshold for the HT(a) 
procedure. 

With 7 > 1, the integral can be approximated by / "((1 — d)ty dt = (1 — d)(7 + 
l)- 1 ^ 1 . Thus 



API(a) » API(a) := (1 - dy 1 



v 7+l 



7 + 1 



-1/7 



A(m, 7To, d)miroa. 



Taking the derivative of API(-) and setting it to zero gives 



' (27+1)/7 = (1 - d)(7 + l)~ 1/7 ^rA(m, tto, d)m7r . 

7+1 



Solving for a gives 



a 



( 7+ l)(l+l/7) 



(1 - d)7T 7 



7/(27+1) 



[A(m,^,d)m]- 7/(27+1) , 



which is an approximate minimizer of API. Setting A(m, itq, d) = m /37r °/(l — d) and 
f3 = 2ir /j gives 



(7 + l)(i+i/7) 



7TQ7 



7/(27+1) 



-(1+2^/7)7/(27+1) 



This particular choice for A is motivated by two facts. When most of the P values 
have the U(0, 1) distribution (equivalently, 7To ~ 1), the d parameter of the con- 
vex backbone can be close to 1; thus with 1 — d in the denominator, a* can be 
unreasonably high in such a case. This issue is circumvented by putting 1 — d in 
the denominator of A, which eliminates 1 — d from the denominator of a*. Next, it 
is instructive to compare a* with the Bonferroni adjustment a* Bon ^ — a jra for a 
pre-specified otQ. If 7 is large, then a* Bon f < a* rj 0(m -1 / 2 ) as m — ► 00. Although 
the derivation required 7 > 1, a* is still well defined even if t:q = 1 (implying 
7 = 1), and in this case a* — 4 1 / 3 m~ 1 is comparable to a* Bon j as m — ► 00. This 
in fact suggests the following significance threshold calibrated with the Bonferroni 
adjustment: 



(4.2) 



Vol : = 4~ 1/3 ( ^ ) a a* = A{^)m- B ^\ 



which coincides with the Bonferroni threshold a^m 1 when ir Q = 1, where 
A(x, y) := [ y /(4V3 x) ] aQ [ {y + l)(^/v) / {xy) ] y/{2y+1) 



(4.3) 



B(x,y) := (1 + 2x 2 /y)y/(2y+l). 
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The factor ao serves asymptotically as a calibrator of the adaptive significance 
threshold to the Bonferroni threshold in the least favorable scenario ttq = 1, i.e., all 
null hypotheses are true. Analysis of the asymptotic ERR of the HT(a* cal ) procedure 
suggests a few choices of ao m practice. 

4.2. Asymptotic ERR of HT(a* al ) 

Recall from ifTTjl that 

ERR(a) = [n a/F m (a)] Pr(P 1:m < a). 

The probability Pr(Pi :m < a) is not tractable in general, but an upper bound 
can be obtained under a reasonable assumption on the set P m of the m P val- 
ues. Massive multiple tests are mostly applied in exploratory studies to produce 
"inference-guided discoveries" that are either subject to further confirmation and 
validation, or helpful for developing new research hypotheses. For this reason often 
all the alternative hypotheses are two-sided, and hence so are the tests. It is in- 
structive to first consider the case of in two-sample t tests. Conceptually the data 
consist of ni i.i.d. observations on R m Xj = [Xn, JQ2, • ■ ■ , JQ m ], i — 1, . . . ,ni in 
the first group, and n<i i.i.d. observations Yj = \Yu, Yii, ■ ■ . , Yi m ], i — 1, ... ,712 in 
the second group. The hypothesis pair (H ok ,HAk) is tested by the two-sided two- 
sample t statistic Tfc = \T(Xk, 34, n±, 712)] based on the data X k = {Xi k , ■ . ■ , X ni k} 
and 34 = {Yxk, ■ ■ ■ ,Y n2 k}. Often in biological applications that study gene sig- 
naling pathways (see e.g., Kuo et al. [18], and the simulation model in Section 
5), Xik and Xih> (i = l,...,ni) are either positively or negatively correlated 
for certain k ^ fc', and the same holds for Yik and Y^ (i — 1, ... ,71%). Such 
dependence in data raises positive association between the two-sided test statis- 
tics T k and T v so that Pr(T fc < t\T' k < t) > Pr(T k < t), implying Pi{T k < 
t, T w < t) > Pr(T fc < t)Pr(T fe / <*),*> 0. Then the P values in turn satisfy 
Pr(P fe > a, P fc ' > a) > Pr(P fc > a) Pr(P fe / > a), a G [0, 1]. It is straightforward to 
generalize this type of dependency to more than two tests. Alternatively, a direct 
model for the P values can be constructed. 

Example 4.1. Let J C {1, . . . , to} be a nonempty set of indices. Assume Pj = 
Pq 3 1 3 ' G J , where Po follows a distribution Fq on [0, 1], and Xj's are i.i.d. contin- 
uous random variables following a distribution H on [0, 00), and are independent 
of the P values. Assume that the Pi's for i £ J are either independent or related to 
each other in the same fashion. This model mimics the effect of an activated gene 
signaling pathway that results in gene differential expression as reflected by the P 
values: the set J represents the genes involved in the pathway, P represents the 
underlying activation mechanism, and Xj represents the noisy response of gene j 
resulting in Pj . Because Pi > a if and only if Xj < log a j log Po , direct calculations 
using independence of the Xj 's show that 

p {nw>^r-(n{-.<^})-.(^[[<^)] m ;, 

where \J\ is the cardinality J . Next 
jej j&J J0 



dF (t) = 



E 



H 



log a 
logP, 
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Finally Pr (P\j e j{Pj > a}) > T[j e jPi' {Pj > a) , following from Jensen's inequality. 

The above considerations lead to the following definition. 

Definition 4.1. The set of P values P m has the positive orthant dependence prop- 
erty if for any a £ [0, 1] 

(rn \ rn 

f){Pi>a} \ >Y[Pr(P>a). 
t=l / i=l 

This type of dependence is similar to the positive quadrant dependence intro- 
duced by Lehmann [20]. 

Now define the upper envelope of the cdf 's of the P values as 

F m (t) := max {£?,(*)}, t £ [0,1], 
where Gi is the cdf of Pi. If P m has the positive orthant dependence property then 

/ rn \ rn 

Pr (Pi :m < a) = 1 - Pr [f]{Pi > a} < 1-JJPr (P* > a) < 1 - (1 - F m (a)) ro , 




(4.4) ERR(a* al ) < ; , ? Qc °' 7 ; s [l-(l-F m (a^)) m ] . 

Because a* al — > as m — ► oo, the asymptotic magnitude of the above ERR can 
be established by considering the magnitude of F m (t m ) and H m (t m ) as t m — > 0. 
The following definition makes this idea rigorous. 

Definition 4.2. The set of m P values P m is said to be asymptotically stable as 
m — ► oo if there exists sequences {/3 m }, {r]m}, {V'm}, {Cm} and constants j3*, /?*, 
?y, tp* , ip*, and £ such that 

F m (t) ~ /3 m i"™ , ff m (t) ~ Vm* U , t — > 

and 

< #, < /3 m < /?* < oo, < 77 < ?7 m < 1 
< -0* < V'm < "0* < 00, < £ < £ m < 1 

for sufficiently large m. 

This definition essentially says that P m is regarded as asymptotically stable if 
the ensemble distribution functions F m {-) and H m {-) vary in the left tail similarly 
to Beta distributions. 

The following theorem establishes the asymptotic magnitude of an upper bound 
of ERR(a* cal ) - the ERR of applying a* al in the hard-thresholding procedure l|2.3|) . 

Theorem 4.1. Let ip*, £ TO) [3* , and 77 be as given in Definition 4-- 2, and let A(-, ■) 
and -B(-, •) be as defined in If the set of P values P m is asymptotically sta- 

ble and has the positive orthant dependence property for sufficiently large m, then 
ERR(a* al ) < *(a* oJ ), and *(a* a/ ) satisfies 

(a) if 7T = 1 for all m, lim^oo *(a* a/ ) = 1 - e~"° ; 

(b) i/7r < 1 arid A(7r ,7) < ^4 < 00 /or some A and sufficiently large m, then 

1 - 7TQ 
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Proof. See Appendix. □ 

There are two important consequences from this theorem. First, the level «o can 
be chosen to bound ERR (and FDR) asymptotically in the least favorable situation 
7To = 1. In this case both ERR and FDR are equal to the family- wise type-I error 
probability. Note that 1 — e~ a ° is also the limiting family-wise type-I error probabil- 
ity corresponding to the Bonferroni significance threshold aom . In this regard the 
adaptive threshold a* cal is calibrated to the conservative Bonferroni threshold when 
7To = 1. If one wants to bound the error level at ol\, then set «o = — log(l — oti). 
Of course «o ~ ot\ for small a\\ for example, «o ~ 0.05129,0.1054,0.2231 for 
a% = 0.05, 0.1, 0.2 respectively. 

Next, Part (b) demonstrates that if the "average power" of rejecting the false 
null hypotheses remains visible asymptotically in the sense that £ m < £ < 1 for 
some £ and sufficiently large to, then the upper bound 

*(a* cal ) c -^V*" 1 M7ro,7)] 1 ~ Cm m-^ B ^ — 0; 

1 — 7T 

therefore ERR(a* al ) diminishes asymptotically. However, the convergence can be 
slow if the power is weak in the sense £ ps 1 (hence H m (-) is close to the U(0, 1) 
cdf in the left tail). Moreover, 4" can be considerably close to 1 in the unfavorable 
scenario ttq ss 1 and £ ~ 1. On the other hand, increase in the average power in the 
sense of decrease in £ makes $ (hence the ERR) diminishes faster asymptotically. 

Note from (|4.3|) that as long as ttq is bounded away from zero (i.e., there is 
always some null hypotheses remain true) and 7 is bounded, the quantity ^(710,7) 
is bounded. Because the positive ERR does not involve the probability Pr(i? > 0), 
part (b) holds for pERR(a* al ) under arbitrary dependence among the P values 
(tests). 



4-3. Data-driven adaptive significance threshold 

Plugging 7? and 7 generated by optimizing l|3.3JI into (|4.2(l produces a data-driven 
significance threshold: 

(4.5) al al :=A{^) m - B ^). 

Now consider the ERR of the procedure HT(a* al ) with a* cal as above. Define 

ERR* ■= l^iHj Pr (R(Kai) > o) . 

The interest here is the asymptotic magnitude of ERR* as m — ► 00. A major 
difference here from Theorem 4.1 is that the threshold a* al is random. A similar 
result can be established with some moment assumptions on A(ttq, 7), where A{-, •) 
is defined in Ij4.3|l and ttq, 7 are generated by optimizing l|3.3|l . Toward this end, still 
assume that P m is asymptotically stable, and let r\ m ^ 77, and £ m be as in Definition 
4.2. Let v m be the joint cdf of pro, 7], and let 



a m := / A(s,t) Vm dv m (s,t) 
ai m ■= I A{s,t)dv m {s,t) 

A(s,t)^du m (s,t). 



a 2 



m •- 
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All these moments exist as long as 7?o is bounded away from zero and 7 is bounded 
with probability one. 

Theorem 4.2. Suppose that P m is asymptotically stable and has the positive 
orthant dependence property for sufficiently large m. Let (3* , rj, ip*, and £ TO be 
as in Definition 4.2. If a m , a\ m and a2 m all exist for sufficiently large m, then 
ERR* < \I> TO and there exist S m £ [77/3,77], e m £ [1/3, 1], and e' m £ [£ m /3,£m] such 
that as m > 00 



( K([3*,a m ,5 m ), if 7T = 1, all m 

fetS)^ 1 ^^' 1 / ^ < efficiently large to, 

where K(f3* ,a m ,S m ) =1- (l- f3*a m m~ Sm ) m 

Proof. See Appendix. □ 

Although less specific than Theorem 4.1, this result still is instructive. First, 
if the "average power" sustains asymptotically in the sense that £ TO < 1/3 so that 
e m > e' m for sufficiently large m, or if linim^oo £ TO = £ < 1/3, then ERR* diminishes 
as m — ► 00. The asymptotic behavior of ERR* in the case of £ > 1/3 is indefinite 
from this theorem, and obtaining a more detailed upper bound for ERR* in this 
case remains an open problem. Next, ERR* can be potentially high if no = 1 always 
or 7To ~ 1 and the average power is weak asymptotically. The reduced specificity in 
this result compared to Theorem 4.1 is due to the random variations in A(no, 7) 
and B(tto,j), which are now random variables instead of deterministic functions. 
Nonetheless Theorem 4.2 and its proof (see Appendix) do indicate that when ttq w 1 
and the average power is weak (i.e., H m {-) is small), for sake of ERR (and FDR) 
reduction the variability in A(7ro,7) and 1? (770,7) should be reduced as much as 
possible in a way to make S m and e m as close to 1 as possible. In practice one 
should make an effort to help this by setting ttq and 7 to 1 when the smoothed 
empirical quantile function Q m is too close to the U(0, 1) quantile function. On the 
other hand, one would like to have a reasonable level of false negative errors when 
true alternative hypotheses do exist even if t:q 1; this can be helped by setting 
ao at a reasonably liberal level. The simulation study (Section 5) indicates that 
cto = 0.22 is a good choice in a wide variety of scenarios. 

Finally note that, just like in Theorem 4.1, the bound when ttq < 1 holds for 
the positive ERR pERR* := E[V(a* al )]/ E[R(a* al )] under arbitrary dependence 
among the tests. 



5. A Simulation study 

To better understand and compare the performance and operating characteristics of 
HT(a* al ), a simulation study is performed using models that mimic a gene signaling 
pathway to generate data, as proposed in [7]. Each simulation model is built from 
a network of 7 related "genes" (random variables), Xq, X\, X2, A3, A4, A190, and 
A221, as depicted in Figure 2, where Xq is a latent variable. A number of other 
variables are linear functions of these random variables. 

Ten models (scenarios) are simulated. In each model there are m random vari- 
ables, each observed in K groups with independent observations in the /cth 
group (k = 1, . . . K). Let /j^ be the mean of variable i in group k. Then m AN OVA 
hypotheses, one for each variable {H oi : fin — ■ ■ ■ — fine, i — 1 • ■ • , to), are tested. 
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Fig 2. A seven-variable framework to simulate differential gene expressions in a pathway. 



Table 2 

Relationships among Xo, X\ X4, Xigo and X221 '■ Xi k j denote the jth observation 
of the ith variable in group k; N(0,cr 2 ) denotes normal random noise. The index j 
always runs through 1,2,3 

X ij i.i.d. N(0, a 2 ); X okj i.i.d. AT (8, a 2 ), k = 2, 3, 4 

X lkj = X 0k //4 + N(0, 0.0784) (Xi is highly correlated with X ; a = 0.28.) 
X 2kj =X m +N{0,o- 2 ), fc = l,2; X 23j =X 03j +6 + ^(0, <r 2 ); X 24j =X 04j +14+JV(0, <x 2 ) 
^3fc 3 = *2kj + N(0, o- 2 ), k = 1, 2, 3,4 

X 4fcj =X 2 )y+A''(0,<r 2 ), fc= 1,2; X43 J =X 23j -6 + Af(0, ( T 2 ); X 44j = X 24) - -8+iV(0, a 2 ) 
Xigo.lj = *3ij + 24 + N(0, or 2 ); X w0 ,2j = X 32j + X 42j + N(0, a 2 ); 

Xi9o, 3j = X 33j - X 43j - 6 + 2V(0, <r 2 ); X 190l 4j = X 34j - 14 + AT(0, (J 2 ) 
X 2 2l,fcj = X 3kj + 24 + JV(0, <r 2 ), k = 1, 2; 

*221,3j = ^33j ~ ^43j + N(0, ^ 2 ); X221,4j = X 34j + 2 + iV(0, <T 2 ) 



Realizations are drawn from Normal distributions. For all ten models the num- 
ber of groups K = 4 and the sample size rife = 3, k = 1, 2, 3, 4. The usual one-way 
ANOVA F test is used to calculate P values. Table 2 contains a detailed description 
of the joint distribution of Xq, . . . , X4, A190 and X221 in the ANOVA set up. The 
ten models comprised of different combinations of m, 7r , and the noise level a are 
detailed in Table 3, Appendix. The odd numbered models represent the high-noise 
(thus weak power) scenario and the even numbered models represent the low-noise 
(thus substantial power) scenario. In each model variables not mentioned in the 
table are i.i.d. N(0,a 2 ). Performance statistics under each model are calculated 
from 1,000 simulation runs. 

First, the ir estimators by Benjamini and Hochberg [3], Storey et al. [31], and 
(I3.1|l are compared on several models. Root mean square error (MSE) and bias are 
plotted in Figure 3. In all cases the root MSE of the estimator (|3.1() is either the 
smallest or comparable to the smallest. In the high noise case (er = 3) Benjamini and 
Hochberg's estimator tends to be quite conservative (upward biased) , especially for 
relatively low true no (0.83 and 0.92, Models 1 and 3); whereas Storey's estimator 
is biased downward slightly in all cases. The proposed estimator l|3.1fl is biased in 
the conservative direction, but is less conservative than Benjamini and Hochberg's 
estimator. In the low noise case (a — 1) the root MSE of all three estimators 
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Models 1 ,3,5,9; sigma=3 




0.82 0.86 



0.90 0.94 0.98 
true piO 



Models 2,4,6,10; sigma=1 




0.82 0.86 0.90 0.94 0.98 
true piO 



Models 1 ,3,5,9; sigma=3 



Models 2,4,6,10; sigma=1 




0.82 0.86 0.90 0.94 0.98 
true piO 




0.82 0.86 0.90 0.94 0.98 
true piO 



Fig 3. Root MSE and bias of the no estimators by Benjamini and Hochberg [3] (circle), Storey 
et al. [31] (triangle), and 13.11 (diamond) 



and the bias of the proposed and the Benjamini and Hochberg's estimators are 
reduced substantially while the small downward bias of Storey's bootstrap estimator 
remains. Overall the proposed estimator i|3.1|) outperforms the other two estimators 
in terms of MSE and bias. 

Next, operating characteristics of the adaptive FDR control ([3]) and q- value 
FDR control ([31]) at the 1%, 5%, 10%, 15%, 20%, 30%, 40%, 60%, and 70% 
levels, the criteria API (i.e., the HT(a* al ) procedure) and I p ([7]), are simulated 
and compared. The performance measures are the estimated FDR (FDR) and the 
estimated false nondiscovery proportion (FN DP) defined as follows. Let mi be the 
number of true alternative hypotheses according to the simulation model, let Ri be 
the total number of rejections in simulation trial I, and let Si be the number of 
correct rejections. Define 

FDR = ^ £*=i° I(Ri > 0)(Ri - SO/Rt 
FNDP = ^ EHVi ~ SO/mi, 

where /(•) is the indicator function. These are the Monte Carlo estimators of the 
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FDR and the FNDP := E [mi - S] /mi (cf. Table 1). In other words FNDP is the 
expected proportion of true alternative hypotheses not captured by the procedure. 
A measurement of the average power is 1 — FNDP. 

Following the discussions in Section 4, the parameter a required in the API 
procedure should be set at a reasonably liberal level. A few values of «o were 
examined in a preliminary simulation study which suggested that ao = 0.22 is a 
level that worked well for the variety of scenarios covered by the ten models in 
Table 3, Appendix. 

Results corresponding to ao = 0.22 are reported here. The results are first sum- 
marized in Figure 4. In the high noise case (a = 3, Models 1, 3, 5, 7, 9), compared 
to I p , API incurs no or little increase in FNDP but substantially lower FDR when 
7r is high (Models 5, 7, 9), and keeps the same FDR level and a slightly reduced 
FNDP when ttq is relatively low (Models 1,3); thus API is more adaptive than I p . 
As expected, it is difficult for all methods to have substantial power (low FNDP) in 
the high noise case, primarily due to the low power in each individual test to reject 
a false null hypothesis. For the FDR control procedures, no substantial number of 
false null hypotheses can be rejected unless the FDR control level is raised to a 
relatively high level of > 30%, especially when 7To is high. 

In the low noise case (a = 1, Models 2, 4, 6, 8, 10), API performs similarly to 
I p , although it is slightly more liberal in terms of higher FDR and lower FNDP 
when 7To is relatively low (Models 2, 4). Interestingly, when 7To is high (Models 6, 8, 
10), FDR control by g-value (Storey et al. [31]) is less powerful than the adaptive 
FDR procedure (Benjamini and Hochberg [3]) at low FDR control levels (1%, 5%, 
and 10%), in terms of elevated FNDP levels. 

The methods are further compared by plotting FNDP vs. FDR for each model 
in Figure 5. The results demonstrate that in low-noise (model 2, 4, 6, 8, 10) and 
high-noise, high-7r (models 5, 7, 9) cases, the adaptive significance threshold deter- 
mined from API gives very reasonable balance between the amounts of false positive 
and false negative errors, as indicated by the position of the diamond (FNDP vs. 
FDR of API) relative to the curves of the FDR-control procedures. It is noticeable 
that in the low noise cases the adaptive significance threshold corresponds well to 
the maximum FDR level for which there is no longer substantial gain in reducing 
FNDP by controlling the FDR at higher levels. There is some loss of efficiency for 
using API in high- noise, low-7To cases (model 1,3)- its FNDP is higher than the 
control procedures at comparable FDR levels. This is a price to pay for not using 
a prespecified, fixed FDR control level. 

The simulation results on API are very consistent with the theoretical results 
in Section 4. They indicate that API can provide a reasonable, data-adaptive sig- 
nificance threshold that balances the amounts of false positive and false negative 
errors: it is reasonably conservative in the high ttq and high noise (hence low power) 
cases, and is reasonably liberal in the relatively low tt and low noise cases. 

6. Concluding remarks 

In this research an improved estimator of the null proportion and an adaptive signif- 
icance threshold criterion API for massive multiple tests are developed and studied, 
following the introduction of a new measurement of the level of false positive er- 
rors, ERR, as an alternative to FDR for theoretical investigation. ERR allows for 
obtaining insights into the error behavior of API under more application-pertinent 
distributional assumptions that are widely satisfied by the data in many recent 



70 



C. Cheng 



o 

\ 

\ 

♦. • 

\ / 


9 

\ 

O 

\ 

o • 

*■ / 


o 

o 




• 

» 

/ 

• 


• 

• 

/ 

• 






V 

/ o 


• • 




• 


• 


• 

o 

o 

• 



1 20 40 60 100 
FDR control level in % 
BH AFDR 



1 20 40 60 100 API Ip 
FDR control level in % 
q-value 



1 20 40 60 100 
FDR control level in % 
BH AFDR 



1 20 40 60 100 
FDR control level in % 
q-value 



API Ip 




1 20 40 60 100 
FDR control level in % 
BH AFDR 



1 20 40 60 100 API Ip 
FDR control level in % 
q-value 



• 

* 

/ 

• 


• 

• 

/ 

• 






o • 


• o 

o 

• 



1 20 40 60 100 1 20 40 60 100 

FDR control level in % FDR control level in % 
BH AFDR q-value 



API Ip 




1 20 40 60 100 
FDR control level in % 
BH AFDR 



1 20 40 60 100 API Ip 
FDR control level in % 
q-value 




1 20 40 60 100 
FDR control level in % 
BH AFDR 



1 20 40 60 100 API Ip 
FDR control level in % 
q-value 



Model 7 Model 8 




1 20 40 60 100 1 20 40 60 100 API Ip 1 20 40 60 100 1 20 40 60 100 API Ip 

FDR control level in % FDR control level in % FDR control level in % FDR control level in % 

BH AFDR q-value BH AFDR q-value 



o. 

o. 
• 


Q. 


• 


/ 

• 

/ 


.'• 

7 

- • 


• 


• 


,7 




• 


• 




.•• 


/ 

*»•••* 





1 20 40 60 100 
FDR control level in % 
BH AFDR 



1 20 40 60 100 API Ip 
FDR control level in % 
q-value 





\ 

\ / 




1 / 

V 
.• 


♦ /* 

*»■•»-. 

4 O O 
M* 


O 

O 

• • 



1 20 40 60 100 1 20 40 60 100 API Ip 
FDR control level in % FDR control level in % 
BH AFDR q-value 



Fig 4. Simulation results on the rejection criteria. Each panel corresponds to a model configu- 
ration. Panels in the left column correspond to the "high noise" case a = 3, and panels in the 
right column correspond to the "low noise" case a = 1. The performance statistics FDR (bullet) 
and FN DP (diamond) are plotted against each criteria. Each panel has three sections. The left 
section shows FDR control with the Benjamini & Hochberg [3] adaptive procedure (BH AFDR), 
and the middle section shows FDR control by q-value, all at the 1%, 5%, 10%, 15%, 20%, 30%, 
40%, 60%, and 70% levels. The right section shows FDR and FNDP of API and I p . 
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Fig 5. FN DP vs. FDR for Benjamini and Hochberg [3] adaptive FDR control (solid line and 
bullet) and q-value FDR control (dotted line and circle) when FDR control levels are set at 1%, 
5%, 10%, 15%, 20%, 30%, 40%, 60%, and 70%. For each model FND~P vs. FDR of the adaptive 
API procedure occupies one point on the plot, indicated by a diamond. 
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applications. Under these assumptions, for the first time the asymptotic ERR level 
(and the FDR level under certain conditions) is explicitely related to the ensemble 
behavior of the P values described by the upper envelope cdf F m and the "average 
power" H m . Parallel to positive FDR, the concept of positive ERR is also useful. 
Asymptotic pERR properties of the proposed adaptive method can be established 
under arbitrary dependence among the tests. The theoretical understanding pro- 
vides cautions and remedies to the application of API in practice. 

Under proper ergodicity conditions such as those used in [31, 14], FDR and ERR 
are equivalent for the hard-thresholding procedure l|2.3[l : hence Theorems 4.1 and 
4.2 hold for FDR as well. 

The simulation study shows that the proposed estimator of the null proportion 
by quantile modeling is superior to the two popular estimators in terms of reduced 
MSE and bias. Not surprisingly, when there is little power to reject each individual 
false null hypothesis (hence little average power), FDR control and API both incur 
high level of false negative errors in terms of FNDP. When there is a reasonable 
amount of power, API can produce a reasonable balance between the false positives 
and false negatives, thereby complementing and extending the widely used FDR- 
control approach to massive multiple tests. 

In exploratory type applications where it is desirable to provide "inference-guided 
discoveries" , the role of ao is to provide a protection in the situation where no true 
alternative hypothesis exits (tt^ = 1). On the other hand it is not advisable to 
choose the significance threshold too conservatively in such applications because 
the "discoveries" will be scrutinized in follow up investigations. Even if setting 
ao — 1 the calibrated adaptive significance threshold is m _1 , giving the limiting 
ERR (or FDR, or family-wise type-I error probability) 1 — e _1 w 0.6321 when 

TT = 1. 

At least two open problems remain. First, although there has been empirical 
evidence from the simulation study that the tto estimator Ij3.1|) outperforms the 
existing ones, there is lack of analytical understating of this estimator, in terms of 
MSE for example. Second, the bounds obtained in Theorems 4.1 and 4.2 are not 
sharp, a more detailed characterization of the upper bound of ERR* (Theorem 4.2) 
is desirable for further understanding of the asymptotic behavior of the adaptive 
procedure. 



Appendix 



Proof of Theorem 4-1- For (a), from (|4.3|l and 14. 4|) . if ttq — 1 for all m, then the 
first factor on the right-hand side of (|4.4|) is 1, and the second factor is now equal 
to 

1 - (1 - a* cal ) m = 1 - (1 - A(l, l)m- s(M) ) m = 1 - (1 - aom- 1 )" 1 ► 1 - e~ a ° 

because A(l, 1) = ao and B(l, 1) = 1. For (b), first 

1 - (1 - F m (a* cal )) m ~ 1 - (1 - (3 m a2r) m < 1 - (1 - P*Aa m^ B ^^) m := e m , 

and e m ~ 1 - exp (-^Aaom 1 ^ 13 ^ ^) — > 1 because £(710,7) < (7 + 2)/(2 7 + 
1) < 1 so that 775(7^,7) < 1. Next, let u m := tf ^lCl ' 
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Then 



ir uj m ttq , [^(Tro^)] 1 c " 

< V* 



(l-U)B(wo,7) 



^oa* cal + (l-n )H m (a* cal ) 1 - tt 1 - tt * m 



for sufficiently large m. Multiplying this upper bound and the limit of e m gives (b). 

□ 



Proof of Theorem 4-2. First, for sufficiently large m, 



Pr (R(a* cal ) > 0) < 1 - / /? m A( a , t)*» m -*»*(«.*) 



du m (s,t) 



< 1 - 



1-/3*/ A(s,i)^m-^ s '*W m (M) 



Because 1/3 < B(tt ,j) < 1 with probability 1, so that m' 7 ' < mP 3 ^ ' 1 ) < m" 1 '/ 3 
with probability 1, by the mean value theorem of integration (Halmos [15], P.114), 
there exists some £ rn £ [m~ J ', to~ ??/ ' 3 ] such that 



/ A(s, tf^m-^^dvrnis, t) = £„ 

JR2 



and £ m can be written equivalently as m Sm for some 5 TO £ [77/3,77], giving, for 
sufficiently large m, 

Pr (i*(S* a ,) > 0) < 1 - (1 - /3*a m m-^) m . 

This is the upper bound ^ m of ERR* if 7r = 1 for all m because now l^(S* a; ) = 
fl(«* ai ) with probability 1. Next, 



E [V(a* cal )] =E[E [V(a* cal )\a* cal ]} = E [n a* cal ] = tt q f - 



A(s,t) 



B(s,t) 



dv m (s,t). 



Again by the mean value theorem of integration there exists e m £ [1/3, 1] such that 
E \y{®*cal)] = noai m m~ £m . Similarly, 

E \H m {a* cal )] ~ i> m f A(s,t)^m-^ B ^du m (s,t) > ^a 2m m-^ 

for some e' m £ [£ m /3, £ m ]. Finally, because 

E [R(a* cal )] =E[E [R{Kai)\Kai]] = ^E [V(a* cal )] + (1 - tt )E [H m (a* cal )] , 
if 7To < 1 for sufficiently large m, then 

n E [a* 



ERR* < 



< 



1- (l-p*a m m- 5 ™y 



call 



1 - 7TQ V a 2 



TTQ I aim \ {Em _ E > m) 



(1 - n )E [H m (a* al )] 

(3* a m m- 5 ™) 7 



□ 
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Table 3 

Ten models: Model configuration in terms of (m, mi , cr) and determination of true alternative 

hypotheses by X\, . . . , X4, -X190 and -X221, where mi is t/ie number of true alternative 
hypotheses; hence itq = 1 — mi/m. n^ = 3 and ii" = 4 /or aZZ models. N(Q,a 2 ) denotes normal 

random noise 



Model 



mi cr True Ha's in addition to X\, . . . X4, Xigo, X221 



3000 500 



Xi 


= Xi + 


iV(0, cr^), i = 


5, . . . , 16 


X, 


= -Xi 


+ jV(0,<T 2 ), i 


= 17, ...,25 


x, 


= x 2 + 


N(0,a 2 ), i = 


26, . . . , 60 


x, 


= -x 2 


+ N(0,o- 2 ), i 


= 61,..., 70 


x, 


= x 3 + 


JV(0,a 2 ), i = 


71,..., 100 


x, 


= -x 3 


+ N(0,a 2 ), i 


= 101,..., 110 


x, 


= x 4 + 


N(0,a 2 ), i = 


111, . . . , 150 


x, 


= -x 4 


+ N(0,a 2 ), i 


= 151, . . . , 189 


x, 


= x wo 


+ N(0,a 2 ), i 


= 191, ... ,210 


x, 


= -X190 + N(0,a 2 ), 


i = 211, ... ,220 


x, 


= X221 


+ N(0,a 2 ), i 


= 222, ... ,250 


x, 


= 2X,_ 


250 + iV(0, a 2 


), i = 251, . . .,500 



2 


3000 


500 


3 


3000 


250 


4 


3000 


250 


5 


3000 


32 



the same as Model 1 

the same as Model 1 except only the first 250 are true Ha's 
the same as Model 3 



Xi 


= Xi + 


N(0,a 2 ), i = 


5,. ..,8 


x, 


= x 2 + 


N(0,a 2 ), i = 


9,..., 12 


x, 


= x 3 + 


N{0,cr 2 ), i = 


13, . . . , 16 


x, 


= x A + 


N(0,a 2 ), i = 


17, ... ,20 


x, 


= Xig 


+ N(0,cr 2 ), i 


= 191,..., 195 


x, 


= X221 


+ N(0,cr 2 ), i 


= 222, ... ,226 



6 


3000 


32 


1 


the same as Model 5 


7 


3000 


6 


3 


none, except Xi, ■ ■ .X±, Xigo, X221 


8 


3000 


6 


1 


the same as Model 7 


9 


10000 


15 


3 


Xi = Xi + N(0,a 2 ), i = 5,6 



10 



10000 



15 



X, = X 2 + N (0,cr 2 ), 
X l= X 3 + N{0,a 2 ), 
X l = X 4 + N{0,a 2 ), 
X 191 = Xi 90 + AT(0, c 
the same as Model 9 



7,8 
9,10 
11, 12 
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